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ABSTRACT 

A multiparameter, programmable model was developed to 
examihe the interactive influence of certain parameters on the 
probability of deciding that an examinee had attained a specified 
^ceqref pii mastery. It vas applied vithin the simulated context of 
performaijce testing of military trainees. These parameters inciadei: 
.(1) the number of assumed ma^ery states--mas ter , nonmaster, and 
perhaps ijntermediate (likely to soon achieve mastery) ;^ (2) tue prior 
distribution of scores from similar examinee gxoups; and (3) the 
number Of ."test trials or^items administered. The results 6f several 
simulations "sflowed that the degree of confidence that a decisionmaker 
^can h^ve about the testee's mastery is marke^dly affected by tne 
values for the thr^e parameters, and the effects of their 
combination.. Osing the Bayesian model, test length and costs could be 
reduce(^--as long as the prior information was accurate and valid fox 
the particular group of examinees. Results of the simulation also 
shoved that a test may be too short to be of decision-making value. 
(Author/GDi:) . ' 
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FOREWORD 



The research presented in this report was conducted underoProject 
METTEST (Methodological Issues in Criterion-Referenced Testing) , under 
the auspices of the Unit Training and Evaluation Systems (UTES) Techni- 
cal Area of the Army Research Institute for the Behavioral and Social 
Sciences (ARC) The goal of Project METTEST is to prpvide quantitative 
'methods for evaluating unit 'prof iciency. .The means for achieving tfiis 
goal include basic research in test construction methodology, measure- 
ment and scaling models, and decisionmaking implications of test score 
interpretation. ARI Technical Paper 306 is the initial publication on 
tlie project* - * . ' 

Related, ongoing programs within the UTES Technical Area include 
evaluation of smaH combat units under simulated battlefield conditions" 
(REALTRAIN) , qualification of tank gunnery crews and revision of Table 
VIII (IDOC) , and improving the standardization and reliability of the ' 
Army Training and Evaluation Program (ARTEP) . 

Anticipated future research under Project METTEST includes the 
development of a computer -programed model for performance evaluation 
and several additional 6.1 basic research grants for the development 
of measurement^ scaling, scoring, decisionmaking, and quality control 
models for use in performance evaluations when criterion-referenced 
testing procedures are employed. 

The present research was conducted by personnel of the UTES Tech- 
nical Area as an in-house research project, under Army Project 
2Q762722A764. G. Gary Boycan supplied a key creative insight into the 
"misclassif ication problem." An earlier version of this paper has been 
printed in the Proceeding^ of the October 1976 Naval Training Equipment 
Center (NTEC) Conference. 




( JOSEPH ZEIDKIER 
Teohnical Director 
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BRIEF 



Requirement : 

The educational decisiofimaker typically wants to know if a student 
can perform a job at some prespecified level of acceptability. If the 
student's test score is above the minimal passing standard, the indi- 
vidual m^y be classified as a master — otherwise, as a nonmaster. The 
present paper describes a mathematical model that provides maximal 
classification accuracy with the least- number of test items or trials- 



Classification Model: * , • * 

Estimates of several -variables must be provided as input to the 
model, which is derived from Bayes* Theorem. Two of these variables 
are probability estimates: the prior expectation of selecting a master 
from the student population and the conditional probability that a known 
master would answer a' randomly selected test item^ correctly. Two other 
variables — the minimal passing standard and the number of test items — 
are under some^ degree of control by. the tester. Furthermore, the effect 
of the latter two variables is an interaction, because the mo^l shows 
that classification accuracy is not invariant over different test length 
when the same, percent correct score is attained by examinees. 

^ Findings: ' . 

^. 

A computer simul^ion of the model demonstrated the effects of " 
simultaneously varying five variables on classification accuracy. The 
arbitr^y nature of defining the criterion for mastery ^as* a percent 
correct test score was critically evaluated. Testing may.be irrele- 
vant in situations where the test length is less than the minimal num- 
ber of items. 

Utilization of Findings: - , ' ( 

V 

The model shows explicitly the- risks involved in using a given 
length of test once the tolerance for misclassif ication error has b^n 
specified by the examiner. 
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A BAYESIAN METHOD FOR EVALUATING 
^ TRAINEE PROFICIENCY 



INTRODUCTION \ 

No instructional system is complete without a strong testing com- 
ponent. Any student who begins an instructional program should be 
able to achieve all the objectives that the program was designed to 
teach. However, some students may require remedial or other supple- 
mentary instruction to master all of the objectives, even though the 
program was carefully developed. Furthermore, during the development 
of the instruction, test data from prospective students are required, 
first to revise and later to validate the instruction. ' To support the 
instructional development activities and to make decisions about the 
abilities of students who have completed instruction, a powerful test- 
ing 'prograrfi is necessary. 

The final desired output of a test for a given examinee is infor- 
mation that can pinpoint ability to do whatever is required by an ob- 
jective. That is, the examiner observes a test score and then infers 
the ability of the examinee. This paper outlines a "Bayesian" method 
for drawing such inferences. It also discusses and illustrates the 
adequacy of the method as a function pf the number of test items ad7 
ministered and the effects of the tester's beliefs about the quality 
of the examinee population on the inferences drawn. 

' Using the Bayesian method, - the testers hypothesized varying num- 
bers of ability groups so that the classification of examinees into 
these ability groups is most useful to the overall instructional sys- 
tem. For example, the simplest case is to classify examinees into two 
groups, the first group containing those who hav^ mastered the objec- 
tive, and the second containing those who have not. Alternatively, 'one 
could hypothesize three groups, consisting of masters, nonmasters, and 
an intermediate group containing^ people whose skills are almost satis- 
factory and, who could be brought up to the mastery level with relatively 
little additional instruction. The Bayesian model presented ifi; this- 
paper explores up to three levels of mastery, although this number ♦ 
could easily be expanded. The model also explores the gffects on de-- 
cisionmaking (correctly classifying masters and nonmasters) if more 
than two ability levels have been hypothesized but are then collapsed 
to form just two groups — masters and nonmasters. 



TRAINING TO MASTERY 

Ideally, the educational decisionmaker wants to know if a person 
(student, trainee) can do a job at some prespecified level of accepta- 
bility. A student who scores above the minimal passing standard on a 
test may be classified as a master; if the score is below the minimal 
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passing score, the stu^ta^t* wouid be termed a nonmaste^:. But since data 
always have some error yariabiiity , misclassifications are likely to 
occur. /. 7 Is - ^ 
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Ideally, the probability of a true positive should be much greater than 
that for. a^ false positive, and the probability for a true negative 
should be much grearter than that for a false ne^gative. 

To evaluate how well our testing program achieves this ^oal, we 
want to be able to infer as accurately a^ possible the conditional 
probability of the mastery Tor iioninastery) state, given the test score 
data, p(MljT), p(M2|T). Out first problem is what amount of data is 
this probabilistic inference based upon? Suppose that the passing 
stcindard was 8D% of the test items correct. A student with 33 out of 
40 items correct would pass and wo^uld be classified as a master. Now 
suppose that on another form of the test (or a test given over the same 
material by another instructor) , another student gets 25 out of 30 test 
items, correct. This student would also have met the 80% correct cri- 
terion and would be classified as a master. The model presented in 
this paper will show that the p(Ml|T) varies systematically with the 
number of test items, along with the mini m al percentage correct for ^ 
passing. \ ^ > 

We may also ask: How is the accuracy of inference about mastery 
affected by postulating more than two states (mastery and nonmastery)? 
cind can the data from various states be combined without seriously af- 
fecting the final p(Ml|T). inference? For example, suppose that there 
are .intermediate states of partial mastery. The following decision 
model shows that p(M1|t) can be more validly estimated when the mastery 
states are processed independently, but that- educational decisionmakers 
will not sacrifice very much classification accuracy if indeed they do 
dichotomize multichotomous data. We suggested that defining an inter- 
mediate group which required minimal remediation might be useful for 
some instructional systems. The model shows that the probability of 
being in the mastery group when indeed the datum was a test score ^ 



^ 2 13 



^obtained by a master will be increased if the^^her data af^e processed 
independently. The concept of "independent processing" requires ^h^t 
all nohmastery groups maintain their integrity ,^rather^han being ag- 
gregated into one generaJj^'ed nonmastery group. ^ 



( . CONSTRUCTION OF THE MODEL 

Bayes Theorem 



The statistical model which we have applied for classifying students 
int.o- mastery and nonmastery groups, given their test scores, is based 
upon a form of Bayes* Theorem: ^ 



^milTi. - / p(t[^i)p(mi) . 

^^"^^'^^ [p(T|Ml)pj(Ml) + p(T|M2)p(M2)] 

Here we assumed that the two states of nature (master and nonmaster) 
ard mutually exclusive and collectively exhaustive, and that T is the 
test score observed. ^ We also assume that the test is dichotomously 
scored and that the items are independent. A correct response is de- 
noted "1," an incorrect response is denoted "0," and the total test 
score is simply the number of corr^ect responses. What we seek to find 
is the term on the left, the probability that: a given student is a mas- 
ter, having been given his test score. ' To find it, we need an estimate 
of the prior probability of mastery (p(Ml)) in the population of stu- 
dents from which this student was drawn. -The^ prior probability of mas- 
tery can be considered the proportion of students in the e3;caminee popu- 
lation we think are masters. For example, if our instruction were very 
good, the prior probability of mastery would be high, and most of the 
students who completed the instruction should have mastered the objec- 
tive. \ The actual number specified for the prior probability of mastery 
may be an informed guess based" on experience ,^or it may be based on the 
empirical results of tests given to previous classes of similar students. 

We must also estimate the conditional probability of a certain 
test score, given that the student wh^ receives that score is a master. 
For example, if only one item is administered, the conditional proba- 
bility of a score of one correct, given that the student was a master, 
is simply the probability that a master responds correctly. We 'may 
estimate this conditional probability empirically based on previous 
student 'groups , or we may provide^ a best guess as to how well masters 
perform, or this conditional probability may reflect a minimal standard 
of achievement. We shall show how the p(m|t) will vary as a function 
of the prior expectations of the tester, number of test items, and con- 
ditional probabilities, p(t|m), after an example to illustrate the 
computations . 
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Suppose that a student chosen at random from a trainee population i 
is given a criterion-reference tegt^ and that |he passes the tiest. 
Given the results of the test., what is the probability that the stu- 
dent is indee^ a master of that 'particular ppurse of instruction? To 
'calculate the 'pirobability, we obtain tbe following information from 
the educational expert who administered the CRT:, The probability that 
a master would obtain a passing score = .90, (p(t|m1) = .90); the proba- 
bility that a nonmaster would obtain a passing score = -05, (p(t|m2) = 
.05); and the prior probability of Randomly selecting a master from 



this trainee population is equal 
of this , and similar previous tJ:^ 
composed of masters. . Substit^j^ti^: 



p(mi|t) = ' 




70, that is, we. believe tliat 70% 
ppulations may be assumed to be 
^e values into the formula 

X .7 



/.9 X .7 + .05 X .3 



equals .977. Hence, before t^e test score was available, the proba- 
bility that this student was a master was .70, but after a passing score 
wa's'observed't, the probability that this person is a master has increased 
to .977. (The probability of this Student's being a nonmaster., given 
the same passing iscore, p(M2|t), would be equal to 1 - .977 pr .023.) 

To generalize the Bayesian aK>roach to a wide variety 6f applica- 
tions in evaluating training effectiveness, two additions, must be made 
to the basic formula. These euiditions are the number of trials or items 
on^the test (N) , and the number of,, hypothesized mastery states (S) . The 
derivation of the general BayesiaA formula for this purpose was origi- 
nally presented by Hershman^: 



p(Mi|T) 



N . 

• n p(Mi|t.) 



N 

, . ILp(Mi|t.) 

^ P(Mx) E ^ N-1^ 

i-1 p(Mi) 



In this formula, p(Mi|tj) equals the conditional probability of a- per- 
son in the ith mastery state getting the jth test item cofrect; p(Mi) 
is the prior probability of the representation of the ith mastery state 
in the student population (the percentage of students who are estimated 



^ershmsm, R. L. A Rule for the Integra.tipn of Bayesian Opinions. 
Huipan Factors , *1971, 13, 255-259. 



to be in the ith mastery state); and p(Mi|T) is. the conditiori^l proba- 
bility, of a particular student being in ^the ith mastery state given his 
total test score, A computational example showing how the forjwlla is 
applied for thi'ee mastery states ^i^ given- in the appendix. 

Variables of Inte3qest in the Present Simulation 

In the typical situation foi? evalua|:lng training proficiency, the ^ 
tester has som^ control over the number of items or trials that he will 
include on a test. In a performance-based test, each trial m^^ be rather 
expensive (suisjh as tank gunnery or field artillery, where eacti shell 
costs over $10o) , and so the tester will be obliged to use a minimxim 
number of trials to meet his decisionmaking requirements. Consequently, 
we examined the effect on p(m|t) when N took on values of 5, 10, 20, and 
40 trials. 

The tester also has responsibility for assigning reasonable values 
to the prior probabilities of mastery ,^ denoted as p(Mi) , and to the con- 
ditional probabilities of a' known master (or nonmaster) getting a ran- 
domly selected item correct, denoted as p(t|Mi). Values for both the ^ 
prior and conditional probabilities were systematically mcinipulated in 
the present simulation. 

?rhe niamber of mastery states is a variable which the trainer and/or^ 
tester may also. set. In some measurements of trainee proficiehcy it 
may be most. ^pipropriate to dichotomize on an all-or-none basis, wherea 
other training evaluation contexts may suggest a "pass, give refresher 
training, recycle failures through complete training" trichotomy. More 
than three mastery states may of course be hypothesized, but the compu- 
tations in the present and all other models of proficiency evaluation 
become extremely complex- (However, we are developing a gomputer program 
that will handle up to five states of mastery.) 

' The dependent variable of main interest is the percent of items 
cinswered correctly. The tester may decide that 70% is a passing score. 
But the 70% value is not an absolute standard, since it is dependent 
upon the number of test items ahd the prior and conditional probability 
estimates. In the present simulation, three values of percent correct 
ODserved scores were used: 60%, '70%, and 80%. 

Changes in p(m|t). Assuming Two Mastery States 

/ ■ ; ■ ' ' ' 

1 The fundamental purpos:e of the present study was to investigate how 
the probability of mastery classification changes as a function of the 
simultaneous manipulation of up to foui; parameters (independent vari- 
ables) . The scope of the study is not exhaustive, since only several 
values of each of the four variables were used. However, some general 
(trends do seem to emerge, as *can be seen in the following figures. 
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Figures 1, 2, Sjid 3 show the Msults of applying the model to a 
situation in which only two^astQjry groups (mastery and nonmastery) 
have been hypothesized. The data poinJjS-j^epresent the probability that 
a trainee is a master, given (conditional upon) his total test score, 
P(M*['P)^ The lines show how the P(M|T) changes as a function o*f varia- 
tions in tTie four parameters: prior expectation of mastery^ the per- - 
centage correct items 'observed, the conditional probabilities of both 
a master and a nonmaster responding correctly to an item, and the num- 
ber of |tems comprising the--fcest. ' » 

Figure 1 represents a testing Situation in which the training was 
of extremely high quality, since-^he proportion of masters in the, train- 
ee population was assumed to equal p. 9-. That- is, p(Ml) = 0.9. Fig- ^ 
ure lA portrays the situation in which both masters Shd nonmasters l^ve 
attained, a rather high degree of proficiency, since the probability, bf 
a master responding correctly to any given it;em is 0.9, and the proba- 
bility of a nonmaster responding correctly is 0.6. If a person scores 
80% on a.;^-item test, the probability that he is a master is approxi- 
mately .91. This probability drops to .65 if a 60% score on 5 items 
(3 out of 5 correct) is obtained. Note that when the test length is ' 
increased to 40 items, an 80% score (32 correct) prodiices A .99 proba- 
bility of mastery. However^ a score of 60% (24 correct) yields an es- 
sentially zero probability of mastery. The effect of th^test lengt^h" 
variable on classification accuracy is dramatic:. If *^'e p(m|t) had to 
be at least 0.5 for a person to be called a master, then scores |^f • 60% 
on a 5-item test would lead to mastery cla"ssif ication. Buft. :a'60.% score 
on a 40-item test-would lead to i^nmastery classification, -f*'^ 

Figure lA also illustrates the effect of "prior beliefs" on p(m|T). 
One^ might suppose intuitively that the chances were ^jfiiuJaigher that a 
person who obtained a score of 60% (even from a 5-i'£em test) came from 
a population whose probsdDility of correctly answering an item was 0.6 
than from a population whose probability of answering an item correctly 
was 0.9. ' However, the relative proportions pf the two groups (expressed 
as prior belief in mastery and nonmastery, or p{Ml) =..9 and p(M2) = .1, 
respectively) are such that the probability of a. person beihg in the 
mastery state is approximately 0.65 for a score of 3 correct (60%) on 
a 5-item test. Only by increasing the number of test items can the c 
strong prior bias in favor of the mastery decision be reversed. Fig- 
ures 2A and 3A show what happens when prior beliefs are not so heavily 
biased in favor of mastery. In neither case is^ the probability of being" 
in thk mastery state above 0.5 for scores of' less than/80%. But Figure 
lA suggests that when prior beliefs heavily favor one group over the 
other, longer length tests should be used. Otherwise, ^he amount of 
data may not be sufficient to force a change ±n the originally held 
prior beliefs. 
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The effect, of changing the prior beliefs concerning the proportion 
of masters andAnonmasters in the examinee population, while holding. all 
other parameters constant, can be seen by comparing corresponding Graphs 
A, B, C, and D in Figures 1, 2, and 3. ^ 

The impact of prior information on classification accuracy is very 
significant: positively so, if the priors are accurate; and unfavor- 
ably, if the priors are inaccurate. Novick and Lewis'^ claim that if 
the criterion level for mastery is kept constant, then low priors will 
require high test scores to convince the (skeptical) decisionmaker that 
the examinee has attained the criterion level for mastery.^ Further, 
^igh priors will allow lower' test scores to convince a (l0ss skeptical) 
decisionmaker that the examinee had attained the same criterion level 
for inastery. In summary, if prior information is strong but inaccurate, 
'then longer tests will be needed to overcome this bias; but if the 
prior information is strong and accurate, then test lengths can be re- 
duced (by 50%, for example) relative to the niamber of items that would 
be required to reach the same decision with no prior information. 

The effect of changing the probability of a correct response, 
p(l|Mi) , can be seen by comparing Graphs A, B, C, and D for Figures 1, 
2, and 3. For example, the only difference between Figure lA and Fig- 
ure IB is that the p(l|Ml) changes from 0.9 to 0.8, all other parameters 
being held constant. (This change might reflect a lower level of re- 
quired proficiency and, hence, less training, for Graph B than for A. 
Or perhaps previous test results indicate that masters of the instruc- 
tion respond to items with a probability of correct response equal to 
0.8 rather than 0.9.)' In any case, the effect of this small change in 
the p(i|m1) on, the p(m|t) is readily apparent. For any test length or 
observed test score, the probability of being in the mastery state is 
greater in Graph B than in A. This shift is most obvious for the 70% 
observed' correct curve. Notice that p(m|t) on .Graph A fo'r an observed 
score of 70% (28 out of 40 correct) is approximately 0.04. However, 
the value for p(m|t) in Graph B for 70% of a 40-item test correct is 
0.87. 

The main reason for this abrupt change from Graph A to B (in Fig- 
ures 1, 2, and 3) is the lowered requirement for mastery, from 0.9 to 
0.8. The probability that "0.9 persons" score only 70% correct on long 
tests is relatively low. But when masters are defined as those trainees 
who come from a .population with a probability of responding correctly 
equal to 0.8, the probability of their scoring 70% on a long test is 
high. One of the most difficult jobs for an instructional designer is 
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to describe the level of capability required of graduates and the level 
of capability actually achieved. Comparison of these graphs indicates 
the magnitude of the effect that these specifications can have on the 
classification of trainees. 

Graphs C and D of Figures 1, 2, and 3 further illustrate the ef- 
fect of variations in the, probability 6f correct responses. The only 
difference between Graphs b' and C is that the probability of a correct 
response from a nonmaster decreases from 0.6.to- O.5.. - The effect of ■ 
this decrease in correct response probability from a nonmaster is to 
increase the probability that. someone with a score. of 70% or 80% will 
be a master. Note tjiat the 70% and 80% curves are higher in Graph C 
than in B. Not evident from the graphs is the additional result that 
nonmasters are less likely to achieve a high score in jC than in B, since 
P(i|m2) = .6 in B, and p(1|m2) = .5 in C. Finally, Graph D portrays an 
extreme -case in which neither masters nor nonmasters are responding at 
particularly high levels. However, the level of performance for non- 
masters is so low (0.4), that even for observed scores of 60% the proba- 
bility of being in the mastery state exceeds 0.8 for all test lengths, 
except for 5 and 10 items in Figure 2, and 5, 10, and 20 items in 
Figure 3 . 

Further detailed analysis of these figures is not included in this 
paper. In comparing the 12 graphs against each other, note the magni- 
tude of the changes in p(m|t) when small changes have been made in the 
prior beliefs, in the correct response probabilities, and in the percent 
correct observed responses. The implication is that extreme care must 
be taken when specifying parameters in'-.a Bay&sian approach to testing 
and decisionmaking. If the parameters are realistic, great savings in - 
testing time and expense, and increased confidence in decisionmaking 
are possible (Novick & Lewis, 1974). However, if the parameters are 
not realistic, there is a very real danger of misclassifying many ex- 
aminees. The next section of this paper deals with an elaboration of 
the model to three mastery states, thus helping to quantify sources 
of classification error. 

Elaboration to. Three Mastery States 

Figures 4, 5, 6, and 7 represent cases for which three mastery 
states have been- hypothesized. In Figures 4 and 6 the probability of 
a correct response for a person assumed to be in mastery state Ml / 
equals 0.8; for mastery state M2 this probability is 0.6; and for mas- 
tery state M3, it is 0.5. These values could correspond to the situa- 
tion in which the nonmastery group -was divided in half. That is, those 
persons whose probability of getting any given item correct is 0.5 
(comprising mastery state M3) would need extensive retraining; whereas 
those whose probability is 0.6 (comprising mastery state M2) would 
merely need selective retraining. People in mastery state Ml h^ve a 
probability of 0.8 for making a correct response and may therefore be 
considered as' "masters" who have successfully passed training. 

■ ■ ■ . 11 
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Figure 7. Conditional probability of mastery when there are three prior states of mastejy 
<j (values from Figure 6|Land three conditional probsbilities of answering an item 

ERXC ^ correctly (values froA Figure 5). 



For Fj^gures 5 and 7, the corresponding, probabilities of a correct 
response fdt 'people in mastery states Ml, mS , and M3 are 0.9, 0.8, and 
0.6, respectively. These probabilijries might describe a situation in 
which the m|ptery group was dichotcwiized, perhaps in an attempt ,to iden- 
tify those ^students. who had achieved an exceptionally high level of pro- 
ficiency, i5%-/ p(llMl) =0.9. 

'^In PigU^s 4 and> 5, the prior probabilities (or assumed proportions) 
of examineesp-n each mastery state are: p(Ml) =0.5, p(M2) = 0.3, and 
pCMSi j= 0.2..^fln Figures 6 and 7, the corresponding prior probabilities 
are 0^25, O.aO, and 0.25, respectively i The prior values in Figures 4 
*and '4>^3x5pl^Wbias toward higher levels of mastery (50% of the ex- 
aminees are ac^ilmed to be type Ml masters) , whereas the bias in Figures 
6 an4^ i^^i^tA*^^ intermediate level of mastery (50% of the examinees 
are k^umedXf d be type M^ masters). 

AJi^detailed analysis of Figures 4 and 5 provides .the basis tor an 
interpretaticJft- o^ Figures 6 and 7, which is ah exercise left to the 
reader. The^hree graphs, labeled A, B, and-C represent the j)robability 
that an" indijidi^l is in mastery state Ml,, M2, and M3, respectively. 
Graph' D represeijLts the probability that a person is in mastery state 
Ml after mastei^ states M2 and M3 have been combined into one composite 

Stat6-.o, n. •[> . 

Graph Figure 4 shows the probability that an individual is 

in^ mastery sfciWHl, given obseipred scores of 60%, 70%, aii^ 80% cotrect 
on^S-,, 10-,^^oJ, and 40-item tests. Thus, for an observed ^ score of 4 , 
out 1of 5 coJlit^ct-, the probability that thistperson is in mastery state 
Ml- as about 0.65. But if this same person' scores 32 out of 40 (still 

co^eqt) , the probability that he is an Ml master jumps to 0.98. 
^es^^^tJlts ^e similar to , those obtained', yfh^n two mastery groups 
weref^y^thesized, and again illustrate the- effect of increasing test 
length on the level of confidence in the mastery classification p(m|t). 

Thd^probability of being in mastery state M2, given observed 
scores, is plotted in Graph B. If a person got 4 out of 5 coi^^ct, the 
probability of being in state M2 is about 0.25. However, if he got 32 
out of 40 correct (still 80% correct), this probability plummets to 
0.02. Finally, using these same test score values. Graph C shows that 
the probability of being a type M3 master is 0.10 for 4 out of 5 cor- 
rect, and nearly zero for 32 out of 40 correct. This result makes in- 
tuitive seiise, because there is only 20,% of type M3 (non) masters in 
the examinee population, and the probability of their getting any item 
correct is only 0^.50, which is a long way from 80% observed correct. 

'.^tice that for any given test length and percent correct, the 
sum of the probabilities of being in states Ml, M2 , and M3 equals 1.0. 
Comparison of Graphs A, B, and C shows that when either 70% or 80% of 
the items for any test length are correctly answered, the probability 
of being in state Ml is greater. than the probability of being in either 



}^ 34 



state M2 or MS- That is, both the 70% 80% curves are higher in 

Graph A than in either Graph B or C. For an observed score of 60%, 
'the probability of being in state M2 is greater than for Ml or M3. 
The probability of being in state M3 is rather low for all values of 
test length and percent correct observed in this particular example. 

Graph D depicts the probability that a person is in mastery state 
Mir as -opposed to a new nonmastery state composed of" both M2 and M3. 
It can be seen that when states M2 and MS have been thus combined, the 
probability of being in state Ml is greater than when all three states 
were analyzed independently. For observed scores of 70% or- 80% correct, 
there is slight difference in the decisioris that would be made iinder„ the 
"independence" versus "ccanposite" conditions. However, if a score of 
60% were observed, the possibility of distinguishing between M2 and M3 
would be lost when those States were combined. This loss of informa- 
tion may be very importamt if there is a large difference in cost be- 
tween the selective training required for people in the M2 state and 
the extensive retraining needed for those in M3. This example also 
illustrates the potential significance of maintaining the integrity of 
the various nonmastery states. If the instructional decisionmaker knew 
the p(Ml) with great, accuracy and also knew that there were two nonmas- 
tery states, but decided to combine the two states of nonmastery into 
just one state, he or she would be throwing away potentially valuable 
information. We shall return to this point in the discussion of Fig- 
ure 5. ^ '^^ 

The interrelationship between test length and 'thre^/hypothesized 
mastery states becomes even more apparent in Figure 5. For example. 
Graph A shows that the probability of being in'state'^Ml for 80% correct 
on a 5-item test is;^about 0.48. The probability of ^ being in state M2 
(shown in Graph B) for 80% correct on a 5-item test is about 0.36. 
There is thus a greater chance that a person whose score is 4 out of 5 
is in Ml (p(M1|t) = 0.48), instead of M2 fp(M2|T) = 0.36)^ or M3 
(P(M3|t) = 0.16). However, if a score of '80% qorrect were observed 
on a 40-item test, the graphs indicate that a much different decision 
would be appropriate. In this case, p(Ml]T) equals 0.21, p(M2|t) = 
.78, and P(M3|t) = 0.01.' Hence, people scoring 32 out of 40 correct 
should be classified as type''M2 masters. Also note that a score of 
60% for any test length implies that these people should be placed in 
the M3 state. 

For the data used in Figure 5, the probability of finding Ml type 
masters is overall quite low. Instead, for the levels of achievement . 
demonstrated by obtained scores of 60%, 70%, or 80%, it is more likely 
that such scores were produced by people in mastery states M2 {p(l|M2) = 
0.8) and M3 (p(l|M3) = 0.6). 
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Graph D in this figure also represents the probability that a per- 
son is in mastery state Ml as ofqpbsed to the new (non) mastery state 
formed by combining states M2 an^ M3. In this example, most of the 
probabilities in Graph D are lower than in Graph A. A^^ glance back at 
Figure 4, Graphs A and D, reveals that the combination of states M2 
and M3 increased the probability of classifying a person with a given 
test score as a type Ml master. Inspection ;.,of the trends in Graphs A 
and D of Figures 4, 5, 6, and 7 suggests that the effect of combining 
mastery states is to enhance the trend of thiB uncombined state. That 
is, if the probability of being in state Ml is high when the three 
states are treated independently the p{Ml|T) will increase after M2 
and M3 are combined. Conversely, if p(Ml|T) is low when the three 
states maintain their integrity, then combining states M2 and M3 tends 
to decrease the p(Ml|T). 

Flowchart Analysis of How the Bayesian 
^ Model Was Developed 

The impact of adding a third mastery state to the development of 
the model can be illustrated by tracing the logic that is required in 
formulating a description of the examinee population. (Refer to accom- 
panying flow chart for a schematic summary of this discussion.) The 
first question the d*ecisionmaker must ask (and which we considered) 
is: Are there two or three states of mastery inherent in the examinee 
population .(Step A)? If two states are posited, parameter estimates 
for p(Ml), p(M2) , p(1|m1), and pC1|m2) are specified, along with plausi- 
ble test lengths and values for the percent coarrect (Step B) • The out- 
put of the Bayesian processing is the probability that a particular 
person is in the mastery state, p(M1|t1 (Step D) . A unique graph for 
each of Figures 1, 2, and 3 was obtained by holding the prior and con- 
ditional probabilities constant while simultaneously varying the test 
lengths and percent correct that would plausibly be observed (Step E) . 
If three states are hypothesized, parameter estimates for p(Ml) , p(M2), 
p(M3), p(i|m1), p(i|m2), and p(l|M3i need to be specified, along with 
values for test lengths and percent correct -(Step F) . 

Now if three states are postulated, a second deci^sion must be 
made (Step G) . It would seem to be usually desirable to determine the 
probabilities of a person's being in each of the three states (Step I). 
Having obtained these probabilities for selected values of prior and 
conditional' probabilities and over a range of test .lengths and percent 
correct scores. Graphs A, B, and C can be drawn such as those shown in 
Figures 4, 5, 6, and 7 (Step J). f 

However, in some instances it may be more convenient to combine 
the infojnnation known about two of the three mastery states. For ex- 
ample, even though one mastery state and two nonm&stery states are hy- 
pothesized, the decisionmaking process may require that people be 
divided into only two groups — "mastery" and "nonmastery. " In the 
present example, states M2 and M3 were combined (Step K) . The result 
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of Bayesian processing on these combined data is the probability that 
a person is in the new mastery state (Step M) . Iteration of this pro- 
cedure for various test lengths and percent correct scores over the 
same prior and conditional probabilities yields Graph D curves, such 
as those of Figures 4,^, 6, and 7 (Step N) . 

The differences that result from following each of the three paths 
in the flow chart can be seen by comparing Figures 3A, 5A, and 5D. In 
each case the prior probability of being in mastery states Ml was set 
equal to 0.50, and the conditional probability that a type Ml master 
would make a correct response to an item was set equal to 0.90. Fig- 
ure 3a corresponds to path A,B,C,D,E in the flow chart. Figure 5A 
corresponds to path A,F,G,H,I,J; and Figure 5D corresponds to path 
A,F,G,K,L,M,N.. 

In Figure 3A, p(i|m2) = 0.6, that is, a nonmaster has a 60% chance 
of correctly responding to an item. However, in Figure 5D the nonmas- 
tery state is the combination of states M2 and M3, with probabilities 
of responding correctly to an item of 0.8 and 0.6, respectively. The 
effect of combining M2 and M3 is to create a new (non) mastery state, 
where the probability of a correct response is a. weighted average of 
the values for the uncombined groups. By defining a relatively high 
ability intermediate state and then combining it with a relatively low 
state, the probability of being in the highest mastery state is lower 
than if that intermediate state remained undefined. In fact, if the 
Figure 5 values of - the prior and conditional probabilities are valid 
representations of the "real" states of mastery, but the values ,of Fig- 
xire 3 (which are a simplification of the Figvure 5 values) are used for 
decisionmaking, then people-achieving scores of 80% will be falsely 
classified as type Ml masters. 

The differential trend between Graphs A and D of Figure 5 is note- 
worthy, although the absolute magnitude of the trend is rather small. 
For different parameter estimates (of prior and conditional probabili- 
ties) , the effect of combining groups may be much more extensive. Note 
also that the information provided in Graph D refers only to Ishe- proba- 
bility of a person's being in the mastery state and does not directly 
show the loss of information about the two discrete nonmastery states 
that have been combined. Furthermore, when two mastery states are 
combined and contrasted to a third nonmastery state, the changes in 
the probability of being in the newly defined -mastery state will often 
be quite different from the probability of being in the original mas- 
tery state. * * 

It must be emphasized that unrealistic descriptions of the examinee 
population (in terms of number of mastery groups) can cause severe dis- 
tortions in classification accuracy. For example, had the decision- 
maker hypothesized only two states when, in fact, training had produced 
three fairly distinct states of, proficiency, the results of his analysis 
could be highly misleading. . Thus, note that the 80% line of Figure 3a 
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ascends as more items are added (i.e., p(Ml|T) increases), whereas the 
80% line of Figure 5D descends (i.e., p(Ml|T) decreases) as more items 
axe added. 

Caution must also be observed in the opposite case, where one 
might be tempted to specify more states of mastery than are actually 
present, in an effort to extract more information than is. justified by 
the test data. 

The present Bayesian model is not limited to three mastery states. 
Exploratory analyses have been conducted with up to five mastery states 
and it is also hoped that the model can be generalized to deal with con 
tinuous distributions. 

TEST LENGTH AND MISCUVSSIFICATION ERROR 

One of the most important questions that must be answered in de- 
signing a training evaluation program is "What is the probability of 
falsely classifying a person on the basis of a given observed score?" 
It is also possible to turn the question around and ask "How long must 
a test be, and what score is required for classification decisions to 
be made with some specified lower limit of misclassif ication?" 

Figures 8 and 9 demonstrate how the Bayesian model can be used to 
answer these two questions. Assuming that the prior and conditional 
probabilities are realistic and fixed, the important variables are then 
test length and cutting score. Suppose that p(Ml) = 0.9, p(M2) = 0.1, 
P(i|m1) = 0.9, and p(l|M2) = 0.6 as in Figures 8 and lA. In this ex- 
ample, the prior belief that an untested trainee is a master is very 
high, p(Ml) = 0.9. A reasonable question might therefore be "What 
score must be observed such that a nonmastery decision can be made with 
at least 90% confidence?" (In other vords, what data are required to 
force a reversal in the prior belief?) 

To be 90% confident of a nonmastery decision, p(M2|t) must be 
equal to at least 0.90. Since the sum of p(Ml |t) " and p(M2 |t) equals 
1.0, p(M1|t) must therefore not be greater than 0.10. Referring to. 
Figure 8, a horizontal line crossing the ordinate at 0.10 can be drawn. 
This line crosses the curve for a 5-item test at a point corresponding 
to 26% correct. The next lowest possible test score is one correct 
(20%) , so the decision rule is that all .persons scoring one correct or ^ 
less should be considered nonmasters. The point on the ordinate cor- 
responding to 20% correct on the 5-item test is about 0.05. Hence, 
the final decision rule states that nonmastery decisions based on an „ 
observed score of 1 correct out of 5 can be made with 95% confidence 
(1.00 - 0.05 = 0.95) . For observed scores lower than the cutoff score, 
the confidence in making a correct decision must increase. Continuing 
with the present example, the p(Mi1t) if zero correct are observed is 
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. Figure 8. Conditional probability of lastery as a function of percent correct using the same 
5U . parameter values as in Figure lA. - j-, 
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^y^- Figure B. Conditional probability of lastery as a function of percent correct using 

the same parameter values as in Figure ID. 



virtually equal to zero. Hence, those persons who get no items right 
may be classified as type M2 nonmasters with nearly 100% confidence. 



A similar analysis applied to the 40^item test curve indicates 
that the cutting score should be cibout 73% correct. The next lowest 
possible score to 73% is 28 correct out of 40 items, or 70%. The proba- 
bility of mastery, given cin observed score of 28 correct, is about 0.04. 
At such a*" low value of p(MiIt) the chances for misclassification using 
a 5-item test and a 40- item test are almost the same. However, the ob- 
served percent correct at which the nonmastery decision is made for the 
two tests is 20% on the 5-item test and 70% on the 4a-item test.^ Super- 
ficially, two tests .of different lengths would seem to produce the same 
decision outcome, and longer tests may not really be necessary for re- 
ducing classification error. ^ 

To appreciate the benefits gained from using longer tests, we 
.must examine the entire curve. Note that at 80%- correct, the 5-item 
test yields a p(M1|t) equal to 0.92. This result means that, on the 
average, 8% of the mastery decisions will be in error, since p(M2|t) 
equals 0.08. For the 40-item test, the probability of mastery, given 
80% correct, is about 0.99. That is, there is only a 1% chance that 
an examinee of nonmastery competence would be incorrectly classified 
as a master. 

A test th^Lt distinguishes sharply between masters and nonmasters 
is one in which the probability of mastery is close to either 0.0 or 
1.00 for most obtained scores. On such tests there is only a small 
region in \^ich classification error is large. For example, in Fig- 
ure 8, for the 40-item test the region where p (Ml |t) is greater than 
0.1 and less than 0.9 extends from 71% to 77% correct. This means that 
the probability, of misclassification (calling a true master a "nonmas- 
ter," and vice versa) will exceed 0.10 only when observed scores range 
from 71% to 77% correct. In contrast, the region of the 5-item test 
curve for which p(M1|t) is greater than 0.10 and less than 0.90 extends 
from 2d>out 26% to about 79% correct. Hence, there is a much larger 
region for which the probability of misclassification exceeds 0.10. 
Therefore, if classification accuracy is to be maximized over the en- 
tire ramge of possible t§st scores, longer tests are required. Ideally, 
a very^ long test woulds^^uce a step function, for which the proba- 
bility of"^ given master^ state would be very close to either zero or^^ 
one. 

Figure 9 can be eUialSteed in a manner similar- to that for Figure 8. 
However, Figure 9 has oniTjbutstcmding characteristic that merits special 
attention. * If nonmasters decisions must be made with 90% confidence, 
and a horizontal line at J)(m|t) =» 0.1 is drawn, the line does not in- 
tersect the curve for the 5-item test. This means that it is not pos- 
sible to classify a nonmaster with 90% confidence if a 5-item test is 
used, given the parameters used in Figure 9. If resource or time con- 
straints are such that rib more than five itCTis may be given, and if the 
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parameter values used in Figure 9 are realistic, and if 90% confidence 
for mastery decisions cure required, then there is no reason to test. 
Testing is irrelevant because no matter what score is observed, in- 
cluding zero correct, the decision rule compels a mastery decision to 
be made. In fact, for the present values, the probability of mastery, 
given zero correct, is equal to 0.21. This simply means that if per- 
sons obtaining a score of zero are classified as nonmasters, 21%'of^ 
them will be misclassified, on the average. 

The implication of these results for performance testing is obvi- 
ous. Since performance tests cure often rather short, it is essential 
to recognize the magnitude of misclassif ication error that can be in- 
curred with such testS^ Designing tests that have clear and direct 
relation to actual performance is certainly a worthwhile and much-needed 
effort. However, reasonable levels of confidence^ in classifying train- 
ees must not be sacrificed merely for the sake of using conveniently 
short tests. " ^ » " 



SUMMARY AND CONCLUSIONS 



/The present Simulation study highlights some very pertinent issues 
for test developers and educational decisionmakers. The simulated re- 
suits demonstrate explicitly the' effects that chang'es in the estimates 
of th(6r esxaminee population qualityv ^number^ bf assumed ^mastery states , 
criteria required for mastery classification, and test length can.havii 
on the probability of correctly classifying a partirCular examinee. ^ 
Furthermore, the simultaneous manipulation of combinations of these 
parameters can produce drastic and complex dhanges in the probability 
of correctly classifying a specific examinee. / 

A unique feature of any Bayesian model is the n^ed for "pric^" in- 
formation. In the present context, this is the estimate of the propor- 
tion of masters and nonmasters in the examinee population. The more 
accurately that such an estimate can be made, the greater the value in 
using a Bayesian approach: "It, is this increment in information that 
is^j^^equivalent to prior observations which permits a reduction in test 
length when a Bayesian procedure is used" (Novick & Lewis, 1974, p. 
149, italics >added)'. If the number of items or trials that can be 
given on a^st is constrained <such as the cost associated with firing 
live ammunition in tank gunnery or field artillery) , then a Bayesian 
model may be desirable. 

The simulation results also demonstrate th§l^' a criterion for mas- 
tery (usually expressed as a percent correct of all po^asible test items 
that could be given) is not ^invariant across vcorious tiest lengths. The 
significant implication is that the probability of correct classifica- 
tion varies as a function of test length, ftostery criterion, and their 
interaction. Classification accuracy improyes with longer length tests 
and with stricter mastery criteria. However, there is a point of 
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diminishing returns, for which Increases in testv length or criterion 
strictness yield succjpssively smaller increments in classification 
accuracy. 

Another unique f eatij^e of the Bayesian approach ±s that it yields 
the probability of a mastery state, given- or ; conditional , upon a spe- 
cific examinee's test 'Score. Since the mastery state is probabilisti- 
cally inferred and not assumed, it is not possible, to gompute false 
positive and false negative errdr rates. However; the model seems to 
be asking the correct question: "What i*s the probability that a given 
examinee is a master, given his test score?" An alternative binomial 
model does give the false positive arid false negative er-ror rates but 
does not give explicit information about^a ^etific examinee. This 
is because it assiomes ^ certain mastery state and thesn works^ "ba^ckwards 
to complete the misclassif ication rates fot that hypothesized mastery 
state, instead of using prior data to infer, the' unqbseiVable mastery 
stat;e, . . . , . ' ' , 

a- 

Hershman's (1971) original formulation of the Bayesian model com- 
bined several srates of nature into a smaller number of states, under 
the assumption that the prior probabilities of the new states were 
equal. This assumption leads to the conclusion that it is^ generally 
undesirable to combine states of nature (mastery) because of the severe 
distortions in classification accurarcy that ..arise- In contrast, our 
approach was to simply combine the prior probabilities, but' not to 
equate them as Hershman did. Hence, p (Ml) = .25,^p(M2), = -.3, and 
p(M3) = .45 would be combined into the values']^ (Ml) .25 and p(M2,3) = 
.75. The effect of this method of combinihg prior probabilities caused 
-relatively little change in classification accuracy, compeared to the 
case where the mastery states were processed distinctly. % Our approach 
of combining prior information seems more reasonable # since one would 
expect that the probability of one state wKich Is not combined . with 
any other should not be affected when tile others are combined. This 
may be called an "indepiendence of states. Of nature" assijunption. 

The final rather significant, insight to be ^ gleaned concerns . the 
issue of minimal test lengths that ilre required when limits for the 
probability of misclassif ication have been specified, by the examiner. 
It has been analytically shown that a tei^t can be too 'short to be of 
any value in decisionmaking,' ^depending upon tl^ .misclassif ication rate 
that the examiner is willing tp tolerate. What this model' does is to , 
show explicitly the risks involve^' in using a given length of test, 
once the tolerance for mi^classification error has been specified by , 
the examiner . . 
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APPENDIX 



A COMPUTATIONAL EXAMPLE FOR THREE MASTERY STATES 



The following example illustrates the computations necessary for 
processing data with the Bayesian model. The values chosen for, this 
example correspond to Figure 4. Assume that there are three states of 
mastery, and unequal prior probabilities for these three states. The 
educational decisionmeiker must provide estimates for the prior proba- 
bilities of master, p(Mi)v For this exaunple let us assume the values 
to be p(Ml) = .5; p(M2) = .3; and p(M3) ^ .2. The decisionmaker must 
also provide estimates for the conditional probability of getting any 
given test item right, given each mastery state. Use the following ' 
values as the conditional probability of getting an item right, given 
a mastery state: p(i|m1) = .8; p(i|m2) = .6; p(i|m3) = .5. The con- 
ditional probabilities of getting an item wrong given a mastery state 
are P(o|m1) = .2; p(o|m2) = .4; and p(o|m3) = .5. 

First we need to calculate the probability that an item is answered 
correctly. For the overall population, 

S 

^' P(tj = correct) = E p(Mi)p(tj = correct|Mi) = (.5) (.8) 

i=l 



+ (.3) (.-6) + (.2) (.5) = .68. 



Likewise , 



S 

p(tj = wrong) = I p(Mi)p(tj = wrong |Mi) 
i=l 



= (.5) (.2) + (.3) (.4) + (.2) (.5)^= .32. 



We also need to obtain the set of conditional probabilities for the 
different ma^stery states, given that an individual item was responded 
to either correctly or wrongly. The general equation is 
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Substituting the 'above values yields 

/• 



p(Ml 
p(M2 
p(M3 



tj = correct) = (.5) (.8) t .68 = .588; 
tj = correct) = (.3) (.6) :■ .68 = .265; and 
tj = correct) = (.2) (.5) i .68 = .147. 



(Note that the sum equals 1.0.) Finally, 



p(Ml 
. p(M2 
. p(M3 



tj = wrong) = (.5) (.2) if .32 = .3125; 
tj =- wrong) => (.3) (.4) > .32 = .375; and 
tj = wrong) => (.2) (.5) 4 .32 = .3125. 



If 6 items were emswered correctly on a 10-item criterion-referenced 
test, the following n p (Mi | tj ) Values resi^lt;^: 

Ml = 3.9 X lo"" ; M2 6.8 X 10 ,^ M3 = 9.6 x 10 . 

Finally, the general Bayesian formula yieldp the conditional probability 
for each mastery state given the total test score. For example, 

1^, (3.9 X 10 ) 272 

p(Mx|T) » p 2 — g"^"" — ^ ' 1" 

, ^,9 I (3.9 X lO"^) ^ (6,8 X lo"^) . (9.6 x 10 ) 
(.5) + 9 9 



Similar cclculations yield p(M2|t) = .47j ^d p(M3|T) = .254. 

In order to combine mastery state^f M2 and M3 into a single mastery 
state (which could represent ccMnbining tthe ' two degrees of nonmastery, 
Figxire 4, Graph D) , the following calct|ilatibns are required. 'The values 

N i 
for p(Ml) and tt p(Ml|tj) remain the sape., .5 and 3.9 x 10"-*, respectively. 
j=l 

OSie new nonmastery state (M2 ' ) occurs as a xesult of combining the pre- v. 
vious states M2 and M3. Hence, 

p(M2") = p(M2) + p(M3)= .3 + .2 .5/ 

** ', 

p(M2*|tj = correct) = p(M2|tj correct) + p(M3|tj = correct) 

=> .265 + .147 = .412, and 

p(M2'|tj =» wrong) « p(M2|tj = wrong) + p(M3|tj = wrong) 
=. ,3>75 + .3125 « .6875. 
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N ^ ' 

Calculation pf 'tt p(M2' |tj) yields 

j-1 • ' T - 

1.09 X 10^^. 

Entering these new values into the general Bayesian Forriiula, the follow- 
ing valiies of p(Ml» |t) and p(M2' |t) are obtained: 



' > 3.9 X lo""^ 



p(Ml' |t) 



(3.9 X 10~^) ^ { 



9 
(.5) 



1.09 X 10 1 
(.5)^ J 



= .264, 



-3 4 

.736. 



p(M2.|T) = ^-09.10 

^ ^^9 (3.9 X 10 ) ^ (1.09 X lO" ) 
" L (-5)^ ' (.5)^ j 

Some interesting properties of the model emerge when an alternative 
procedure for combining mastery groups is used. Note that to combine 
two mastery states it is not necessary to calculate new values for 
P(i|m2') and p(0|m2'). However, it is" possible to show that these val- 
ues are weighted averages of p(i|m2) and p(l|M3), and p(0|m2) and 
P(0|m3), respectively, where the weights are the relative proportions 
of the new state accourfted for by each of the, previous states. The 
calculations follow. 

Since p(M2) = .3 and p(M3) = .2, state M2 accounts for 60% and M3 
accounts for 40% of the new state M2 ' . Hence , the value of 

P(i|m2') = (.6)p(i|m2) + (.4)p(liM3) = *(.6)(.6) + (.4) (.5) = . 56 and 

P(0|m2M = (.6)p(0|m2) + (.4)p(0|m3) = (.6) (.4) + (.4) (.5) = .44. 

Using these new values, 

V 

V 

p(tj = correct) = p (Ml ' )p (1 |m1 ' ) + p(M2 ' )p (1 |m2 ' ) 
= (.5) (.8) + (.5) (.56) = .68 and 

. p(tj = wrong) = p(Ml')p(0|Ml') + p(M2' )p(0|m2' ) 
= (.5)(.2) + (.5)(.44) = .32.' 

Finally, p(M2' |l) and p(M2' |o) may IS^ calculated. 
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p(M2.|l) . P(M2')p(l|M2.) ^L^li^. ,412, 



P(l) 

and 



p(„2'|0) . P(M2')p(0|M2'? , (.5)( 44) ^ ^^^^^^ 
plO) .32 ^ 



These values are the same as those obtained by the simple addition pro- 
cedure shown above. 

This exercise serves to illustrate the effect of combining two mas- 
tery states. Combining states M2 and M3 creates, in effect, a new dej 
scription of , the examinee population in which only two mastery states 
are hypothesized. The peurameter estimates for *^e new states in this 
example , are , 

p(Mi) =* .5 . p(M2) « ,5 

P(i|m1) = .8 p(1|M2) = .56. 

In choosing to combine groups, the decisionmaker must conside^ whether 
a two-srtate description of the population with parameter estimates such 
as' those is a better representation than the original three-state 

descriptions with pcirameter estimates. 

p(Ml) = .5, p(M2) = .3, p(M3) = .2, 
P(i|m1) = .8, p(1|m2) « -6, p(1|m3) = .5. 
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